We have chosen Copenhagen as our city. Based on the analysis of the data set we conclude that the best predictors of AirBnB prices for a 4 night stay for 2 people in Copenhagen are:
To conduct our analysis and come up with model that best predicts the price we analysed the data set, and selected variables that drive prices from a logical point of view and used those for our base model. From this base model we conducted statistical analysis of all the other variable looking at the correlations to prices and added variables that were correlated with prices and improved on our models. Although more variables that the 6 variables included above were correlated to prices most of them did not significantly affect the price most likely due to their own correlation to variables already in the model thus we omitted them.
In conclusion, the model that best predicts prices is log(price_4_nights) = prop_type_simplified + number_of_reviews+ review_scores_rating + room_type + accommodates + neighbourhood_cleansed_simplified+ availability_30+ reviews_per_month and it explains 52% of the variation in prices of Airbnb rentals for a 4 night stay for 2 people in Copenhagen
Interpretation: Number of rows(observations): 9625; Number of columns(variables): 74; Number of numeric variables: 37; Number of character variables: 23; Number of date variables: 5; Number of logical variables (factor variables): 9.
Note: logical variables have a fixed or known set of possible values, thus they are also factor variables, for example, host_is_superhost(TRUE/FALSE), host_has_profile_pic(TRUE/FALSE)
#variables / columns
dplyr::glimpse(listings)Rows: 9,625
Columns: 74
$ id <dbl> 6983, 26057, 29118, 31094…
$ listing_url <chr> "https://www.airbnb.com/r…
$ scrape_id <dbl> 2.021093e+13, 2.021093e+1…
$ last_scraped <date> 2021-09-30, 2021-09-30, …
$ name <chr> "Copenhagen 'N Livin'", "…
$ description <chr> "Lovely apartment located…
$ neighborhood_overview <chr> "Nice bars and cozy cafes…
$ picture_url <chr> "https://a0.muscache.com/…
$ host_id <dbl> 16774, 109777, 125230, 12…
$ host_url <chr> "https://www.airbnb.com/u…
$ host_name <chr> "Simon", "Kari", "Nana", …
$ host_since <date> 2009-05-12, 2010-04-17, …
$ host_location <chr> "Copenhagen, Capital Regi…
$ host_about <chr> "I'm currently working as…
$ host_response_time <chr> "N/A", "N/A", "within a f…
$ host_response_rate <chr> "N/A", "N/A", "100%", "N/…
$ host_acceptance_rate <chr> "N/A", "N/A", "50%", "0%"…
$ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALS…
$ host_thumbnail_url <chr> "https://a0.muscache.com/…
$ host_picture_url <chr> "https://a0.muscache.com/…
$ host_neighbourhood <chr> "Nørrebro", "Indre By", "…
$ host_listings_count <dbl> 1, 1, 1, 1, 3, 1, 0, 1, 2…
$ host_total_listings_count <dbl> 1, 1, 1, 1, 3, 1, 0, 1, 2…
$ host_verifications <chr> "['email', 'phone', 'revi…
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified <lgl> FALSE, TRUE, TRUE, TRUE, …
$ neighbourhood <chr> "Copenhagen, Hovedstaden,…
$ neighbourhood_cleansed <chr> "Nrrebro", "Indre By", "V…
$ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, N…
$ latitude <dbl> 55.68641, 55.69196, 55.67…
$ longitude <dbl> 12.54741, 12.57637, 12.55…
$ property_type <chr> "Private room in rental u…
$ room_type <chr> "Private room", "Entire h…
$ accommodates <dbl> 2, 6, 2, 3, 5, 4, 4, 4, 1…
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text <chr> "1 shared bath", "1.5 bat…
$ bedrooms <dbl> 1, 4, 1, 1, 3, 2, 1, 2, N…
$ beds <dbl> 1, 4, 1, 3, 4, 2, 1, 3, 1…
$ amenities <chr> "[\"Cooking basics\", \"W…
$ price <chr> "$370.00", "$2,400.00", "…
$ minimum_nights <dbl> 2, 4, 7, 2, 3, 100, 6, 5,…
$ maximum_nights <dbl> 15, 1125, 14, 10, 365, 11…
$ minimum_minimum_nights <dbl> 2, 4, 3, 2, 3, 100, 6, 5,…
$ maximum_minimum_nights <dbl> 2, 4, 5, 2, 3, 100, 6, 5,…
$ minimum_maximum_nights <dbl> 15, 1125, 14, 10, 365, 11…
$ maximum_maximum_nights <dbl> 15, 1125, 14, 10, 365, 11…
$ minimum_nights_avg_ntm <dbl> 2.0, 4.0, 4.1, 2.0, 3.0, …
$ maximum_nights_avg_ntm <dbl> 15, 1125, 14, 10, 365, 11…
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30 <dbl> 0, 17, 0, 0, 7, 0, 7, 23,…
$ availability_60 <dbl> 0, 45, 0, 0, 10, 0, 23, 5…
$ availability_90 <dbl> 0, 69, 15, 0, 10, 14, 36,…
$ availability_365 <dbl> 0, 340, 101, 0, 12, 289, …
$ calendar_last_scraped <date> 2021-09-30, 2021-09-30, …
$ number_of_reviews <dbl> 168, 51, 22, 17, 75, 7, 7…
$ number_of_reviews_ltm <dbl> 0, 1, 0, 0, 2, 0, 0, 0, 0…
$ number_of_reviews_l30d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ first_review <date> 2013-01-02, 2016-02-06, …
$ last_review <date> 2018-11-23, 2019-12-14, …
$ review_scores_rating <dbl> 4.78, 4.90, 4.91, 4.87, 4…
$ review_scores_accuracy <dbl> 4.78, 4.91, 4.85, 4.80, 4…
$ review_scores_cleanliness <dbl> 4.78, 4.96, 4.77, 4.87, 4…
$ review_scores_checkin <dbl> 4.87, 4.91, 5.00, 4.85, 4…
$ review_scores_communication <dbl> 4.90, 4.83, 5.00, 4.80, 4…
$ review_scores_location <dbl> 4.72, 4.96, 4.85, 4.85, 4…
$ review_scores_value <dbl> 4.71, 4.80, 4.77, 4.46, 4…
$ license <lgl> NA, NA, NA, NA, NA, NA, N…
$ instant_bookable <lgl> FALSE, FALSE, FALSE, FALS…
$ calculated_host_listings_count <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ calculated_host_listings_count_entire_homes <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1…
$ calculated_host_listings_count_private_rooms <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0…
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month <dbl> 1.58, 0.74, 0.35, 0.26, 0…
You can find a full data dictionary here. Below are the definitions of some of the most important variables:
price = cost per night in Danish krone
property_type: type of accommodation (House, Apartment, etc.)
room_type:
number_of_reviews: Total number of reviews for the listing
review_scores_rating: Average review score (0 - 100)
longitude , latitude: geographical coordinates to help us locate the listing
neighbourhood: three variables on a few major neighborhoods in each city
## converting 'price' to a numeric variable
listings_clean <- listings %>%
mutate(price = readr::parse_number(price)) %>%
##dropping non-numeric characters before/after the first number from variable 'bathrooms'
mutate(bathrooms_text=replace(bathrooms_text, bathrooms_text=="Shared half-bath", 0.5)) %>%
mutate(bathrooms_text=replace(bathrooms_text, bathrooms_text=="Half-bath", 0.5)) %>%
mutate(bathrooms_text=replace(bathrooms_text, bathrooms_text=="Private half-bath", 0.5)) %>%
## converting 'bathrooms' to a numeric variable
mutate(bathrooms = readr::parse_number(bathrooms_text)) #check price is a number
typeof(listings_clean$price)[1] "double"
skimr::skim(listings)| Name | listings |
| Number of rows | 9625 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 23 |
| Date | 5 |
| logical | 9 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 9625 | 0 |
| name | 1 | 1.00 | 1 | 248 | 0 | 9357 | 0 |
| description | 329 | 0.97 | 2 | 1000 | 0 | 9179 | 0 |
| neighborhood_overview | 4242 | 0.56 | 4 | 1000 | 0 | 5163 | 0 |
| picture_url | 0 | 1.00 | 61 | 126 | 0 | 9525 | 0 |
| host_url | 0 | 1.00 | 39 | 43 | 0 | 8452 | 0 |
| host_name | 3 | 1.00 | 1 | 28 | 0 | 2808 | 0 |
| host_location | 19 | 1.00 | 2 | 119 | 0 | 397 | 0 |
| host_about | 4267 | 0.56 | 1 | 6639 | 0 | 4464 | 10 |
| host_response_time | 3 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 3 | 1.00 | 2 | 4 | 0 | 62 | 0 |
| host_acceptance_rate | 3 | 1.00 | 2 | 4 | 0 | 97 | 0 |
| host_thumbnail_url | 3 | 1.00 | 55 | 106 | 0 | 8384 | 0 |
| host_picture_url | 3 | 1.00 | 57 | 109 | 0 | 8384 | 0 |
| host_neighbourhood | 4118 | 0.57 | 5 | 20 | 0 | 29 | 0 |
| host_verifications | 0 | 1.00 | 2 | 158 | 0 | 252 | 0 |
| neighbourhood | 4242 | 0.56 | 7 | 55 | 0 | 181 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 25 | 0 | 11 | 0 |
| property_type | 0 | 1.00 | 4 | 35 | 0 | 47 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 15 | 1.00 | 6 | 17 | 0 | 21 | 0 |
| amenities | 0 | 1.00 | 2 | 1612 | 0 | 9308 | 0 |
| price | 0 | 1.00 | 5 | 11 | 0 | 1387 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-30 | 2021-09-30 | 2021-09-30 | 1 |
| host_since | 3 | 1.00 | 2009-05-12 | 2021-09-27 | 2015-06-26 | 3040 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-30 | 2021-09-30 | 2021-09-30 | 1 |
| first_review | 1382 | 0.86 | 2011-07-09 | 2021-09-29 | 2019-01-28 | 2121 |
| last_review | 1382 | 0.86 | 2011-07-21 | 2021-09-30 | 2019-12-23 | 1551 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 3 | 1 | 0.14 | FAL: 8269, TRU: 1353 |
| host_has_profile_pic | 3 | 1 | 0.99 | TRU: 9556, FAL: 66 |
| host_identity_verified | 3 | 1 | 0.79 | TRU: 7607, FAL: 2015 |
| neighbourhood_group_cleansed | 9625 | 0 | NaN | : |
| bathrooms | 9625 | 0 | NaN | : |
| calendar_updated | 9625 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.98 | TRU: 9432, FAL: 193 |
| license | 9625 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.21 | FAL: 7643, TRU: 1982 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 2.720424e+07 | 16424268.66 | 6.983000e+03 | 1.307881e+07 | 2.730204e+07 | 4.201887e+07 | 5.251236e+07 | ▇▆▅▆▇ |
| scrape_id | 0 | 1.00 | 2.021093e+13 | 0.00 | 2.021093e+13 | 2.021093e+13 | 2.021093e+13 | 2.021093e+13 | 2.021093e+13 | ▁▁▇▁▁ |
| host_id | 0 | 1.00 | 8.435638e+07 | 102624593.48 | 1.677400e+04 | 1.050400e+07 | 3.627811e+07 | 1.324883e+08 | 4.248214e+08 | ▇▂▁▁▁ |
| host_listings_count | 3 | 1.00 | 1.077000e+01 | 53.03 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.460000e+02 | ▇▁▁▁▁ |
| host_total_listings_count | 3 | 1.00 | 1.077000e+01 | 53.03 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.460000e+02 | ▇▁▁▁▁ |
| latitude | 0 | 1.00 | 5.568000e+01 | 0.02 | 5.562000e+01 | 5.567000e+01 | 5.568000e+01 | 5.569000e+01 | 5.573000e+01 | ▁▃▇▆▁ |
| longitude | 0 | 1.00 | 1.256000e+01 | 0.03 | 1.245000e+01 | 1.254000e+01 | 1.256000e+01 | 1.258000e+01 | 1.264000e+01 | ▁▂▇▆▂ |
| accommodates | 0 | 1.00 | 3.450000e+00 | 1.77 | 0.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.600000e+01 | ▇▆▁▁▁ |
| bedrooms | 218 | 0.98 | 1.680000e+00 | 1.39 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.010000e+02 | ▇▁▁▁▁ |
| beds | 63 | 0.99 | 2.060000e+00 | 1.51 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 2.500000e+01 | ▇▁▁▁▁ |
| minimum_nights | 0 | 1.00 | 4.590000e+00 | 20.84 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.111000e+03 | ▇▁▁▁▁ |
| maximum_nights | 0 | 1.00 | 5.637100e+02 | 536.17 | 1.000000e+00 | 2.000000e+01 | 3.650000e+02 | 1.125000e+03 | 4.000000e+03 | ▇▇▁▁▁ |
| minimum_minimum_nights | 1 | 1.00 | 4.610000e+00 | 20.85 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.111000e+03 | ▇▁▁▁▁ |
| maximum_minimum_nights | 1 | 1.00 | 5.120000e+00 | 25.63 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.400000e+03 | ▇▁▁▁▁ |
| minimum_maximum_nights | 1 | 1.00 | 6.481200e+02 | 534.59 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 4.000000e+03 | ▆▇▁▁▁ |
| maximum_maximum_nights | 1 | 1.00 | 6.578100e+02 | 532.65 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 4.000000e+03 | ▆▇▁▁▁ |
| minimum_nights_avg_ntm | 1 | 1.00 | 4.770000e+00 | 21.01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 4.000000e+00 | 1.111000e+03 | ▇▁▁▁▁ |
| maximum_nights_avg_ntm | 1 | 1.00 | 6.543700e+02 | 532.30 | 1.000000e+00 | 2.800000e+01 | 1.125000e+03 | 1.125000e+03 | 4.000000e+03 | ▆▇▁▁▁ |
| availability_30 | 0 | 1.00 | 5.930000e+00 | 9.59 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 9.000000e+00 | 3.000000e+01 | ▇▁▁▁▁ |
| availability_60 | 0 | 1.00 | 1.374000e+01 | 20.60 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.500000e+01 | 6.000000e+01 | ▇▁▁▁▂ |
| availability_90 | 0 | 1.00 | 2.293000e+01 | 32.05 | 0.000000e+00 | 0.000000e+00 | 2.000000e+00 | 4.400000e+01 | 9.000000e+01 | ▇▁▁▁▂ |
| availability_365 | 0 | 1.00 | 1.007200e+02 | 125.83 | 0.000000e+00 | 0.000000e+00 | 3.100000e+01 | 1.790000e+02 | 3.650000e+02 | ▇▂▁▁▂ |
| number_of_reviews | 0 | 1.00 | 1.984000e+01 | 35.88 | 0.000000e+00 | 2.000000e+00 | 8.000000e+00 | 2.300000e+01 | 6.600000e+02 | ▇▁▁▁▁ |
| number_of_reviews_ltm | 0 | 1.00 | 2.200000e+00 | 5.12 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 1.680000e+02 | ▇▁▁▁▁ |
| number_of_reviews_l30d | 0 | 1.00 | 4.000000e-01 | 1.12 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.700000e+01 | ▇▁▁▁▁ |
| review_scores_rating | 1382 | 0.86 | 4.740000e+00 | 0.55 | 0.000000e+00 | 4.680000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_accuracy | 1457 | 0.85 | 4.830000e+00 | 0.29 | 1.000000e+00 | 4.780000e+00 | 4.920000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_cleanliness | 1457 | 0.85 | 4.690000e+00 | 0.41 | 1.000000e+00 | 4.560000e+00 | 4.800000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_checkin | 1457 | 0.85 | 4.880000e+00 | 0.27 | 1.000000e+00 | 4.860000e+00 | 4.970000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_communication | 1457 | 0.85 | 4.900000e+00 | 0.27 | 1.000000e+00 | 4.890000e+00 | 5.000000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_location | 1458 | 0.85 | 4.820000e+00 | 0.27 | 1.000000e+00 | 4.750000e+00 | 4.890000e+00 | 5.000000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| review_scores_value | 1458 | 0.85 | 4.710000e+00 | 0.33 | 1.000000e+00 | 4.600000e+00 | 4.770000e+00 | 4.920000e+00 | 5.000000e+00 | ▁▁▁▁▇ |
| calculated_host_listings_count | 0 | 1.00 | 6.030000e+00 | 26.19 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.830000e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 5.710000e+00 | 26.22 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.830000e+02 | ▇▁▁▁▁ |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 2.900000e-01 | 0.94 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.100000e+01 | ▇▁▁▁▁ |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.000000e-02 | 0.10 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | ▇▁▁▁▁ |
| reviews_per_month | 1382 | 0.86 | 8.500000e-01 | 1.26 | 1.000000e-02 | 2.100000e-01 | 4.600000e-01 | 9.700000e-01 | 2.600000e+01 | ▇▁▁▁▁ |
Interpretation:
We have 23 character variables that specify information about the property hosts, such as their name, location, response rate, picture and bio. We observe that 4 variables are related to the neighborhood of the properties, and they have more than 4000 missing values. We can also observe that over 4000 hosts do not have their “about” section filled. Most importantly, we see that ‘price’ is in the character variable section, instead of numeric variables.
In the logical variables, we find values that have fixed or known set of values, such as TRUE and FALSE. We observe that 4 logical variables have 9625 missing values, which implies that they are completely empty since we have 9625 rows in total in our dataset. These variables are neighbourhood_group_cleansed, bathrooms, calendar_updated and license.
In the numeric variables, we observe that more than 1400 values are missing from the 5 review scores variables, which indicates that almost 15% of the tenants do not leave a review. We also see that 218 properties do not have their number of bedrooms listed, which is an important variable for calculating the price of the property.
Note: We are examining 5 variables of interest, namely: 1. price 2. number of beds 3. room capacity 4. minimum nights 5. maximum nights
For now, we have chosen these variables based on their potential ability to explain our target variable** \(Y\), the cost for 2 people to stay at an Airbnb location for 4 nights.
mosaic::favstats(listings_clean$price)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 609 | 850 | 1.2e+03 | 1e+05 | 1.09e+03 | 2.13e+03 | 9625 | 0 |
Interpretation: Since mean > median, the distribution of ‘price’ is positively skewed. Minimum price is 0, which is odd because property listings cannot be done for free.
mosaic::favstats(listings$beds)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 25 | 2.06 | 1.51 | 9562 | 63 |
Interpretation: Since mean > median, the distribution of ‘beds’ is positively skewed. Minimum number of beds is 0, which is odd because every listed property should have atleast 1 available bed. We observe that the maximum number of beds are 25, while the median is 2, which implies that the data is positively skewed, with outliers on the higher side of the mean.
mosaic::favstats(listings$accommodates)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 3 | 4 | 16 | 3.45 | 1.77 | 9625 | 0 |
Interpretation: Since mean > median, the distribution of ‘accommodates’ is positively skewed. We want to calculate the price of 4 nights for 2 people, hence all rooms that accommodate lesser than 2 people are not of interest. We observe that minimum room capacity is 0, and Q1 is 2, so we can conclude that we approximately exclude the first quartile of the room capacity data in our analysis.
mosaic::favstats(listings$minimum_nights)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 1 | 2 | 3 | 4 | 1.11e+03 | 4.59 | 20.8 | 9625 | 0 |
Interpretation: Since mean > median, the distribution of ‘minimum_nights’ is positively skewed. Our target variable requires price of properties for 4 nights, thus we exclude all data with minimum_nights requirement of more than 4 from our analysis. We observe that maximum number of minimum_nights is 1111, which is odd because there are only 365 in a year.
mosaic::favstats(listings$maximum_nights)| min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|
| 1 | 20 | 365 | 1.12e+03 | 4e+03 | 564 | 536 | 9625 | 0 |
Interpretation: Since mean > median, the distribution of ‘maximum_nights’ is positively skewed. We need maximum_nights to exclude properties that have maximum_nights less than 4, because our target variable requires properties which allow stay of atleast 4 days.
listings_clean%>%
ggplot(aes(x=price),binwidth=10)+
geom_histogram()+
theme_minimal()+
ggtitle("Histogram of price")+
NULLlistings_clean%>%
filter(price<=3500) %>%
ggplot(aes(x=price),binwidth=10) +
geom_histogram(alpha=0.7, colour="black")+
theme_bw()+
ggtitle("Histogram of price less than 3500")+
NULLlibrary(purrr)
listings_numeric <- listings %>%
select("minimum_nights", "maximum_nights", "review_scores_rating", "number_of_reviews")%>%
filter(minimum_nights<=30)%>%
filter(maximum_nights<=1600) %>%
filter(number_of_reviews<=200)
listings_numeric %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(color="black", fill="pink")+
theme_bw()+
geom_density(alpha=0.5)+
NULLLogical variables that are factor variables: 1. host_is_superhost 2. host_has_profile_pic 3. host_identity_verified 4. has_availability 5. instant_bookable
Character variables that are factor variables: 1. property_type 2. room_type 3. bathrooms_text 4. host_response_time 5. host_neighborhood
# plotting bar graph of host_is_superhost
listings%>%
filter(!is.na(host_is_superhost))%>%
ggplot(aes(x=host_is_superhost))+
geom_bar()+
theme_bw()# plotting bar graph of instant_bookable
listings%>%
filter(!is.na(instant_bookable))%>%
ggplot(aes(x=instant_bookable))+
geom_bar()+
theme_bw()##review_scores_rating with different host_is_superhost levels
listings%>%
filter(!is.na(host_is_superhost))%>%
ggplot(aes(x=host_is_superhost, y=review_scores_rating))+
geom_col()+
theme_bw()##number_of_reviews with different host_identity_verified levels
listings%>%
filter(!is.na(host_identity_verified))%>%
ggplot(aes(x=host_identity_verified, y=number_of_reviews)) +
geom_col() +
theme_bw()Interpretation:
The correlations between variables are surprisingly low as the numerical data fails to capture the intangible factors such as location and marketing (e.g. quality of description). The variable review_scores_location appears to explain 0.025 (2.5%) of the price and whilst this is not significant, the relationship could be explored later through a regression.
As seen from the chunk of code below, variables relating to reviews do not correlate with a higher price. Logically, this appears reasonable as a property, which has a high amount of ratings and high overall rating could be a low-priced property and vice versa. As a group, the relationship between the different sub-sections of ratings (e.g. review_scores_cleanliness and review_scores_communication) are surprisingly low at 0.51.
There are some variables, which are conditional on the value of a categorical value. For instance, “accommodates” and “beds” have 74.6% correlation. This is sensible as clearly the “accommodates” variable is the sum of beds and structure of bedrooms. The variables, which are closest to a perfect correlation at 0.999 are “calculated_host_listings_count” and “calculated_host_listings_count_entire_homes”.
listings_clean %>%
select(c(price, number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_communication,review_scores_value,review_scores_location,reviews_per_month)) %>% #we expected these variables to be strongly correlated however they are not
ggpairs(alpha = 0.4)Interpretation: From the chunk below, we find a stronger correlation between price, accommodates, beds, bedrooms, availability_30, availability_60, availability_90, and availability_365. However, the variables in groups (1) accommodates, beds and bedrooms and (2) availability_30, availability_60, availability_90, and availability_365 are highly correlated. Therefore, we attempt to remove the noise by only selecting accommodates (15.1% correlation to price) and availability_30 (10.8% correlation to price) in the next graph.
listings_clean %>%
select(c(price, accommodates, beds, bedrooms, availability_30, availability_60,availability_90,availability_365)) %>%
ggpairs(alpha = 0.4)Interpretation: Finally, we test other variables, which we do no suspect to have a great influence on price. We surprisingly find that “calculated_host_listings_count” and “calculated_host_listings_count_entire_homes” have an almost perfect correlation at 0.999.
listings_clean %>%
select(c(price, number_of_reviews,number_of_reviews_ltm, number_of_reviews_l30d,calculated_host_listings_count, calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms, calculated_host_listings_count_shared_rooms, reviews_per_month,)) %>%
ggpairs(alpha = 0.4)head(listings_clean %>%
group_by(property_type) %>%
summarise(count=n()) %>%
mutate(proportion=count/sum(count)) %>%
arrange(desc(count))
)| property_type | count | proportion |
|---|---|---|
| Entire rental unit | 5803 | 0.603 |
| Entire condominium (condo) | 1200 | 0.125 |
| Private room in rental unit | 972 | 0.101 |
| Entire residential home | 465 | 0.0483 |
| Entire serviced apartment | 274 | 0.0285 |
| Entire townhouse | 206 | 0.0214 |
Interpretation: We can see that the four most common property types are “Entire rental unit”, “Entire condominium (condo)”, “Private room in rental unit”, “Entire residential home”. Together they account for 8,440 listings which corresponds to 87.68% of the total amount. For simplicity, we will group the remaining ~13% of listings in the category “Other”, by creating a new variable called “prop_type_simplified”.
listings_prop <- listings_clean %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit","Private room in residential home", "Entire residential home","Entire condominium (condo)") ~ property_type,
TRUE ~ "Other"
))head(listings_prop %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n))
)| property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 5803 |
| Entire condominium (condo) | Entire condominium (condo) | 1200 |
| Private room in rental unit | Other | 972 |
| Entire residential home | Entire residential home | 465 |
| Entire serviced apartment | Other | 274 |
| Entire townhouse | Other | 206 |
Interpretation: From the table above we can see that all property type categories, except for the top four ones in terms of listings, are transformed into the category “Other” for the prop_type_simplified variable.
# barplot for property_type vs average price
listings_prop %>%
group_by(prop_type_simplified)%>%
summarise(average_price = mean(price)) %>%
ggplot(aes(x=prop_type_simplified,y=average_price))+
geom_col()+
ggtitle("Average Property Price vs Property Type")+
geom_text(aes(label = c(1061.12,1125.19,1354.68,954.92,473.51)), vjust = 1.5, colour = "white")+
NULLInterpretation: The above histogram shows us the average rental price of properties in each category of the variable prop_type_simplified. Looking at the resulting data, we see that properties in the type “Entire residential home” have the highest average price () and “Private room in residential home” has the lowest average price (). These observations should make intuitive sense, as entire residential homes should costs more than a single room in the same type of home.
listings_prop %>%
select(price,review_scores_rating, review_scores_location, review_scores_value,
number_of_reviews, reviews_per_month,
bedrooms,beds, availability_365) %>%
ggpairs(alpha=0.5)+
theme_bw()Interpretation: “Review_scores_value” and “Review_scores_rating” have the highest significant correlation of 74.1%. This result makes sense as the two variables measure the same item, namely, the visitors’ satisfaction with the accommodation.
Questions:
- What are the most common values for the variable minimum_nights? - Is there any value among the common values that stands out? - What is the likely intended purpose for Airbnb listings with this seemingly unusual value for minimum_nights?
Filter the airbnb data so that it only includes observations with minimum_nights <= 4
#most common value for minimum_nights
listings_prop %>%
group_by(minimum_nights) %>%
count(sort=TRUE)# A tibble: 66 × 2
# Groups: minimum_nights [66]
minimum_nights n
<dbl> <int>
1 2 2871
2 3 2149
3 1 1757
4 4 1017
5 5 790
6 7 368
7 6 231
8 14 86
9 30 75
10 10 46
# … with 56 more rows
summary(as.factor(listings$minimum_nights)) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1757 2871 2149 1017 790 231 368 13 4 46 3 13 10 86 19 6
18 19 20 21 22 23 25 27 28 29 30 31 35 36 39 40
3 1 25 13 2 1 8 1 11 4 75 4 4 1 1 3
44 45 49 50 56 60 61 66 69 70 75 80 85 89 90 92
1 4 1 5 1 18 1 1 1 3 1 2 1 1 17 1
99 100 105 110 120 150 160 170 180 200 300 360 365 400 500 600
1 3 1 1 1 2 1 1 4 2 2 1 1 1 1 1
1000 1111
1 1
Interpretation: The most common value for “minimum_nights” among all listings is 2 nights. Furthermore, it is quite surprising that some listings that are intended for travel purposes require a minimum stay of 30 nights, considering that most of the countries and companies worldwide don’t even allow their employees that many vacation days. Most likely, these Airbnb listings are intended for people who live in the specific city only in the short-term, for example for internships, and need an accommodation for that time period.
#filter for data with less than 4 nights
listings_less_than_4 <- listings_prop %>%
filter(minimum_nights <= 4)#correlation (ggpairs with the filter for less than equal to 4 nights)
listings_less_than_4 %>%
select(price,review_scores_rating, review_scores_location, review_scores_value,
number_of_reviews, reviews_per_month,
bedrooms,beds, availability_365) %>%
ggpairs(alpha=0.5)+
theme_bw()leaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)listings%>%
filter(minimum_nights<=4)%>%
group_by(neighbourhood_cleansed)%>%
count()# A tibble: 11 × 2
# Groups: neighbourhood_cleansed [11]
neighbourhood_cleansed n
<chr> <int>
1 Amager st 574
2 Amager Vest 739
3 Bispebjerg 276
4 Brnshj-Husum 143
5 Frederiksberg 780
6 Indre By 1414
7 Nrrebro 1250
8 sterbro 799
9 Valby 314
10 Vanlse 184
11 Vesterbro-Kongens Enghave 1321
listings%>%
filter(minimum_nights<=4)%>%
ggplot(aes(x=neighbourhood_cleansed))+
geom_bar()+
theme(axis.text.x=element_text(angle=70, size=8, vjust=0.6))+
labs(title="Distribution of Airbnb homes among neighborhoods",x="Neighbourhood",y="Count")+
NULLInterpretation: Looking at the map and the distribution of listings across all neighborhoods, we can observe that, unsurprisingly, listings are more concentrated in the city center and become more scarce the further me move out. Indre By has the most Airbnb listings of all (1414 listings) followed by Nrrebro with 1250 Airbnb listings.
listings_clean_map<-listings_clean%>%
group_by(neighbourhood_cleansed) %>%
summarise_at(vars(price),
list(name = mean))
listings_clean_map%>%
ggplot(aes(x=neighbourhood_cleansed,y= name))+
geom_point()+
theme(axis.text.x=element_text(angle = 70, size=8, vjust=0.6))+
labs(title="Average rental price per night in each neighbourhood",x="Neighbourhood",y="Average price per night")+
geom_text(aes(label = c(1000.77,1101.55,695.19,815.97,1143.60,1488.38,885.83,1071.06,921.62,841.43,1050.33)), vjust = 1.5, colour = "black")+
theme(axis.text.x=element_text(angle=70, size=8, vjust=0.6))+
theme_bw()+
NULLlistings_clean%>%
ggplot(aes(x=neighbourhood_cleansed))+
geom_boxplot(aes(y= price))+
labs(title="Range of rental price per night in each neighbourhood",x="Neighbourhood",y="Price per night")+
theme(axis.text.x=element_text(angle=70, size=8, vjust=0.6))+
theme_bw()+
NULLInterpretation: Looking at the average rental price in each neighbourhood, it becomes clear that Indre By is by far the neighborhood with the highest average rental prices (~DKK 1490 per night). The cheapest Airbnbs, on average, are in Bispjebjerg. When taking a look at the range of the prices in each area, we observe some big outliers (probably due to incorrect reporting of the hosts), which significantly influence the average prices.
To start with, our target variable will be the cost for two people to stay at an Airbnb location for four nights.
#creating our output variable "price_4_nights"
regression_df_1 <- listings_prop %>%
filter(accommodates >= 2, minimum_nights <= 4, maximum_nights >= 4) %>%
mutate(price_4_nights = price*4) %>%
filter(!is.na(price_4_nights)) # removing any missing values , but none seem to exist Note: To define price_4_nights, we filtered for rooms that accommodate at least 2 people (>=2) because 2 people can also stay in a room with capacity of 3 or more people.
Secondly, since our guests want to stay for 4 nights, we filtered minimum nights lesser or equal to 4 (<=4), because our guests can stay at rooms with minimum nights requirements between 1 and 4.
Moreover, we filtered maximum nights for larger or equal to 4 (>=4), because our guests have to be able to stay for at least 4 nights. The logic behind filtering these variables for the conditions is that we want to get rid of pricing data that is not feasible for our guest requirements.
regression_df_1 %>%
ggplot(aes(price_4_nights)) +
geom_histogram(color="black", fill="grey")+
theme_bw()+
geom_density(alpha=0.5) +
NULLregression_df_2 <- regression_df_1 %>%
mutate(log_price_4_nights = log(price_4_nights))
regression_df_2 %>%
ggplot(aes(log_price_4_nights)) +
geom_histogram(color="black", fill="pink")+
theme_bw()+
geom_density(alpha=0.2) +
NULLInterpretation:
Going forward we should use log_price_4_nights. As seen from above, logging price_4_nights makes the variable roughly normally distributed. This is desirable because for running a basic OLS regression analysis, one’s input variables should be normally distributed. Put simply, by logging the variable, we are reducing the skewness of the variable price_4_nights.
model1 <- lm(log_price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating,
data = regression_df_2)
summary(model1)
Call:
lm(formula = log_price_4_nights ~ prop_type_simplified + number_of_reviews +
review_scores_rating, data = regression_df_2)
Residuals:
Min 1Q Median 3Q Max
-2.5087 -0.3199 -0.0543 0.2607 4.7275
Coefficients:
Estimate Std. Error
(Intercept) 8.2123361 0.0616266
prop_type_simplifiedEntire rental unit 0.0400720 0.0196014
prop_type_simplifiedEntire residential home 0.3175498 0.0357001
prop_type_simplifiedOther -0.2407395 0.0234648
prop_type_simplifiedPrivate room in residential home -0.5872703 0.0694131
number_of_reviews -0.0003745 0.0001664
review_scores_rating -0.0054367 0.0122712
t value Pr(>|t|)
(Intercept) 133.260 <2e-16 ***
prop_type_simplifiedEntire rental unit 2.044 0.0410 *
prop_type_simplifiedEntire residential home 8.895 <2e-16 ***
prop_type_simplifiedOther -10.260 <2e-16 ***
prop_type_simplifiedPrivate room in residential home -8.461 <2e-16 ***
number_of_reviews -2.250 0.0245 *
review_scores_rating -0.443 0.6577
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5094 on 6455 degrees of freedom
(952 observations deleted due to missingness)
Multiple R-squared: 0.07019, Adjusted R-squared: 0.06933
F-statistic: 81.22 on 6 and 6455 DF, p-value: < 2.2e-16
car::vif(model1) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.027885 4 1.003444
number_of_reviews 1.024992 1 1.012419
review_scores_rating 1.005517 1 1.002755
autoplot(model1)+ theme_bw()regression_df_2 %>%
group_by(prop_type_simplified) %>%
summarise(count=n())| prop_type_simplified | count |
|---|---|
| Entire condominium (condo) | 970 |
| Entire rental unit | 4647 |
| Entire residential home | 292 |
| Other | 1440 |
| Private room in residential home | 65 |
Interpretation:
Having run a simple OLS regression, we can see that reviews_scores_rating is not a significant explanatory variable of log_price_4_nights at the 5% significance level. This can be deduced from the relatively low t value (-0.443) and correspondingly low p-value. Thus, when controlling for property type, reviews do not seem to affect prices in this simple model. This seems intuitive because a property’s review score should not directly affect a listing’s price but rather the willingness of a customer to book said listing. Hence, we would expect it to have a direct relationship with something like occupancy rate.
Moreover, we can see that all dummy variables derived from “prop_type_simplified” are significant at least at the 5% significance level. Thus, they all affect our dependent variable log_price_4_nights. This was to be expected as the size of a rental unit should be a critical contributing factor in determining price. When interpreting the sign of our property types we need to remind ourselves of the base case, which is “Entire condominium (condo)”. In light of this it makes sense that “private” and “other” rooms come at a discount while “entire” rental units and residential homes come at a premium.
We also observe that “number_of_reviews” is significant at the 5% significance level with a t-value of -2.25. However, the effect on price appears negligible when compared to property types.
model2 <- lm(log_price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
data = regression_df_2)
summary(model2)
Call:
lm(formula = log_price_4_nights ~ prop_type_simplified + number_of_reviews +
review_scores_rating + room_type, data = regression_df_2)
Residuals:
Min 1Q Median 3Q Max
-2.5202 -0.2798 -0.0423 0.2304 4.8236
Coefficients:
Estimate Std. Error
(Intercept) 8.2318293 0.0568499
prop_type_simplifiedEntire rental unit 0.0363823 0.0180717
prop_type_simplifiedEntire residential home 0.3183529 0.0329133
prop_type_simplifiedOther 0.3452995 0.0278946
prop_type_simplifiedPrivate room in residential home 0.3616344 0.0699177
number_of_reviews 0.0002402 0.0001547
review_scores_rating -0.0117452 0.0113216
room_typeHotel room -0.2837109 0.1197797
room_typePrivate room -0.9668193 0.0287123
room_typeShared room -0.6852324 0.1676615
t value Pr(>|t|)
(Intercept) 144.799 < 2e-16 ***
prop_type_simplifiedEntire rental unit 2.013 0.0441 *
prop_type_simplifiedEntire residential home 9.672 < 2e-16 ***
prop_type_simplifiedOther 12.379 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home 5.172 2.38e-07 ***
number_of_reviews 1.553 0.1206
review_scores_rating -1.037 0.2996
room_typeHotel room -2.369 0.0179 *
room_typePrivate room -33.673 < 2e-16 ***
room_typeShared room -4.087 4.42e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4697 on 6452 degrees of freedom
(952 observations deleted due to missingness)
Multiple R-squared: 0.2101, Adjusted R-squared: 0.209
F-statistic: 190.6 on 9 and 6452 DF, p-value: < 2.2e-16
car::vif(model2) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.569426 4 1.125199
number_of_reviews 1.042645 1 1.021100
review_scores_rating 1.007005 1 1.003496
room_type 2.602351 3 1.172810
autoplot(model2)+ theme_bw()regression_df_2 %>%
group_by(room_type) %>%
summarise(count=n())| room_type | count |
|---|---|
| Entire home/apt | 6486 |
| Hotel room | 17 |
| Private room | 901 |
| Shared room | 10 |
Interpretation:
From the above we can see that all types of room have a statistically significant negative effect on log_price_4_nights at the 5% level. Private and shared room are both significant at the 1% level. The significant effects were to be expected as the type of room (private vs. shared) should be a big contributing factor in determining price. The signs of our variables have to be interpreted in conjunction with our base case. As can be seen from our room_type overview table, the base case is “Entire home/apt”, thus, it makes sense that all other room types command a reduction in price.
Moreover, as expected, adding room_type as an additional explanatory variable changes the estimates for our other coefficients. While all property types remain significant at the 5% level, number of reviews becomes insignificant. Overall, when controlling for property and room type, both review ratings and number of reviews do not significantly affect our output variable.
There are a couple of things to note here: review_scores_rating and number_of_reviews are both relavtively insignficant across both models and seem to only have negligble effects. Thus, they do not appear to be an explanatory variable of log_price_4_nights. However, for our subsequent analysis we will keep them as a control variable, as when dropped into the residual they might otherwise induce omitted variable bias. After all, beyond size and type of property, price will be most importantly affected by the underlying quality of the listing for which reviews tend to be a good indicator. Thus going forward, we will keep them in our model specification.
When looking at the VIF of prop_type_simplified we see that it increases significantly when introducing room_type. This makes sense, since there is some overlap in these two variables in that both of them give an indication of whether a listing is private or shared. However, the VIF is still small and within in the acceptable range of below 5. Moreover, our adjusted R-squared goes up by a large margin, from roughly 7% to 21%. Thus, adding room_type drastically improves the explanatory power of our model, which is why will keep it as a regressor going forward.
From the residual plots for Model 1 and Model 2 we can however clearly see that there seems to be a relationship within the data that has not been accounted for yet. Moreover, both our scale-location show a similar clustering like the residual plots, indicating that variability is not the same for all levels of price. This is why we will now investigate further explanatory variables that may or may not improve our model’s explanatory power.
Question: Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?
Note: In our data set, the number of bathrooms was not recorded across all observations. Thus, we cannot analyse this variable Moreover, intuitively, the number of people a listing can accommodate will closely correlate with the listing’s number of beds and bedrooms. For this we will first estimate a regression model that includes all variables and then investigate their relationships further through a correlation analysis. We will then reason why we only keep one of them.
model3_0 <- lm(log_price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
beds +
bedrooms +
accommodates,
data=regression_df_2
)
msummary(model3_0) Estimate Std. Error
(Intercept) 7.799e+00 5.341e-02
prop_type_simplifiedEntire rental unit 1.586e-02 1.628e-02
prop_type_simplifiedEntire residential home -1.006e-01 3.126e-02
prop_type_simplifiedOther 1.406e-01 2.561e-02
prop_type_simplifiedPrivate room in residential home 1.059e-01 6.366e-02
number_of_reviews -3.008e-05 1.395e-04
review_scores_rating -1.507e-02 1.041e-02
room_typeHotel room -5.468e-02 1.071e-01
room_typePrivate room -6.094e-01 2.747e-02
room_typeShared room -4.199e-01 1.497e-01
beds 7.102e-03 5.586e-03
bedrooms 1.160e-01 1.041e-02
accommodates 7.912e-02 5.738e-03
t value Pr(>|t|)
(Intercept) 146.005 < 2e-16 ***
prop_type_simplifiedEntire rental unit 0.974 0.32986
prop_type_simplifiedEntire residential home -3.219 0.00129 **
prop_type_simplifiedOther 5.489 4.2e-08 ***
prop_type_simplifiedPrivate room in residential home 1.663 0.09627 .
number_of_reviews -0.216 0.82929
review_scores_rating -1.448 0.14770
room_typeHotel room -0.510 0.60982
room_typePrivate room -22.182 < 2e-16 ***
room_typeShared room -2.806 0.00504 **
beds 1.271 0.20362
bedrooms 11.145 < 2e-16 ***
accommodates 13.789 < 2e-16 ***
Residual standard error: 0.4183 on 6306 degrees of freedom
(1095 observations deleted due to missingness)
Multiple R-squared: 0.3735, Adjusted R-squared: 0.3723
F-statistic: 313.2 on 12 and 6306 DF, p-value: < 2.2e-16
autoplot(model3_0)+ theme_bw()car::vif(model3_0) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 3.014135 4 1.147877
number_of_reviews 1.054560 1 1.026918
review_scores_rating 1.007912 1 1.003948
room_type 2.891021 3 1.193553
beds 2.492727 1 1.578837
bedrooms 3.337548 1 1.826896
accommodates 3.585671 1 1.893587
regression_df_2 %>%
select(c(accommodates, beds, bedrooms)) %>%
ggpairs(alpha = 0.3)Interpretation:
Our regression that includes all 3 indicators of size shows that beds is insignificant at the 5% significance level. Moreover, when looking at the VIF, we can see that beds, bedrooms and accommodates have quite high factors. We expect that by removing two out three variables, the VIF of the remaining variable will be significantly lower.
From the correlation analysis we can see that there is a relatively strong correlation across all three variables. This seems intuitive as they all give an indication for the exact same thing, namely the size of the listing. Therefore, going forward, we will will only use accommodate as a proxy for size as it has the highest correlation with the other two respective variables.
model3 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
accommodates,
data=regression_df_2
)
msummary(model3) Estimate Std. Error
(Intercept) 7.7777527 0.0523895
prop_type_simplifiedEntire rental unit 0.0217315 0.0162460
prop_type_simplifiedEntire residential home -0.0298486 0.0308851
prop_type_simplifiedOther 0.1749336 0.0254437
prop_type_simplifiedPrivate room in residential home 0.1494879 0.0630701
number_of_reviews -0.0001015 0.0001393
review_scores_rating -0.0086904 0.0101754
room_typeHotel room -0.1529111 0.1077020
room_typePrivate room -0.6378892 0.0271346
room_typeShared room -0.4974142 0.1507595
accommodates 0.1317883 0.0033617
t value Pr(>|t|)
(Intercept) 148.460 < 2e-16 ***
prop_type_simplifiedEntire rental unit 1.338 0.181056
prop_type_simplifiedEntire residential home -0.966 0.333860
prop_type_simplifiedOther 6.875 6.77e-12 ***
prop_type_simplifiedPrivate room in residential home 2.370 0.017808 *
number_of_reviews -0.729 0.466319
review_scores_rating -0.854 0.393103
room_typeHotel room -1.420 0.155726
room_typePrivate room -23.508 < 2e-16 ***
room_typeShared room -3.299 0.000974 ***
accommodates 39.203 < 2e-16 ***
Residual standard error: 0.4221 on 6451 degrees of freedom
(952 observations deleted due to missingness)
Multiple R-squared: 0.362, Adjusted R-squared: 0.3611
F-statistic: 366.1 on 10 and 6451 DF, p-value: < 2.2e-16
car::vif(model3) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.919242 4 1.143296
number_of_reviews 1.046742 1 1.023104
review_scores_rating 1.007064 1 1.003526
room_type 2.878473 3 1.192688
accommodates 1.227559 1 1.107953
autoplot(model3)Interpretation:
From the regression output above, we can see that the VIF of accomodates has dropped significantly. This means by remooving beds and bedrooms it appears that we have adequately addressed potential issues of multicolinearity. Additionally, accomodates is a highly significant variable with a t-value of 39.2. This makes sense since the price of a listing will inevitably be affected by the number of people than stay there. The sign of impact is positive, which also seems intuitive as one would expect a larger property to also command a higher price.
Question: Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
model4 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
accommodates +
host_is_superhost,
data=regression_df_2
)
msummary(model4) Estimate Std. Error
(Intercept) 7.7770591 0.0528731
prop_type_simplifiedEntire rental unit 0.0216563 0.0162519
prop_type_simplifiedEntire residential home -0.0298041 0.0308931
prop_type_simplifiedOther 0.1748245 0.0254566
prop_type_simplifiedPrivate room in residential home 0.1491981 0.0630982
number_of_reviews -0.0001104 0.0001452
review_scores_rating -0.0086089 0.0103041
room_typeHotel room -0.1525479 0.1077354
room_typePrivate room -0.6380485 0.0271534
room_typeShared room -0.4972827 0.1507944
accommodates 0.1317985 0.0033635
host_is_superhostTRUE 0.0033037 0.0149302
t value Pr(>|t|)
(Intercept) 147.089 < 2e-16 ***
prop_type_simplifiedEntire rental unit 1.333 0.18273
prop_type_simplifiedEntire residential home -0.965 0.33471
prop_type_simplifiedOther 6.868 7.14e-12 ***
prop_type_simplifiedPrivate room in residential home 2.365 0.01808 *
number_of_reviews -0.760 0.44706
review_scores_rating -0.835 0.40348
room_typeHotel room -1.416 0.15684
room_typePrivate room -23.498 < 2e-16 ***
room_typeShared room -3.298 0.00098 ***
accommodates 39.185 < 2e-16 ***
host_is_superhostTRUE 0.221 0.82488
Residual standard error: 0.4222 on 6448 degrees of freedom
(954 observations deleted due to missingness)
Multiple R-squared: 0.362, Adjusted R-squared: 0.3609
F-statistic: 332.6 on 11 and 6448 DF, p-value: < 2.2e-16
car::vif(model4) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.922263 4 1.143444
number_of_reviews 1.135800 1 1.065739
review_scores_rating 1.018817 1 1.009365
room_type 2.882026 3 1.192934
accommodates 1.227787 1 1.108055
host_is_superhost 1.118070 1 1.057388
autoplot(model4)Interpretation:
From the above output we can see a couple of things: First, host_is_superhost is not a significant explanatory variable for price_4_nights at the 5% significance level. Secondly, while VIF is not showing any signs of colinerarity, R2 has not changed compared to our latest model (model3) and the RSME even slightly increases. We also do not see any changes in our residual and normal Q-Q plots. This all indicates that host_is_superhost is not adding any new information to the model. We also do not suspect that host_is_superhost accounts for any omitted bias which would necessitate to include it as a control variable. Overall, for the reasons above we will drop this variable going forward.
Question: Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?
model5 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
accommodates +
instant_bookable,
data=regression_df_2
)
msummary(model5) Estimate Std. Error
(Intercept) 7.7778975 0.0525623
prop_type_simplifiedEntire rental unit 0.0217313 0.0162473
prop_type_simplifiedEntire residential home -0.0298759 0.0308977
prop_type_simplifiedOther 0.1749135 0.0254524
prop_type_simplifiedPrivate room in residential home 0.1494817 0.0630753
number_of_reviews -0.0001013 0.0001394
review_scores_rating -0.0087098 0.0101917
room_typeHotel room -0.1526379 0.1080024
room_typePrivate room -0.6378270 0.0271968
room_typeShared room -0.4973088 0.1508023
accommodates 0.1317959 0.0033692
instant_bookableTRUE -0.0004678 0.0135925
t value Pr(>|t|)
(Intercept) 147.975 < 2e-16 ***
prop_type_simplifiedEntire rental unit 1.338 0.18109
prop_type_simplifiedEntire residential home -0.967 0.33362
prop_type_simplifiedOther 6.872 6.92e-12 ***
prop_type_simplifiedPrivate room in residential home 2.370 0.01782 *
number_of_reviews -0.727 0.46748
review_scores_rating -0.855 0.39281
room_typeHotel room -1.413 0.15762
room_typePrivate room -23.452 < 2e-16 ***
room_typeShared room -3.298 0.00098 ***
accommodates 39.118 < 2e-16 ***
instant_bookableTRUE -0.034 0.97254
Residual standard error: 0.4221 on 6450 degrees of freedom
(952 observations deleted due to missingness)
Multiple R-squared: 0.362, Adjusted R-squared: 0.361
F-statistic: 332.8 on 11 and 6450 DF, p-value: < 2.2e-16
car::vif(model5) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.923420 4 1.143501
number_of_reviews 1.048260 1 1.023846
review_scores_rating 1.010129 1 1.005052
room_type 2.903501 3 1.194411
accommodates 1.232844 1 1.110335
instant_bookable 1.020213 1 1.010056
autoplot(model5)Interpretation:
From the above output we can see that instant_bookable is not a significant explanatory variable for price_4_nights at the 5% significance level. Moreover, like for host_is_superhost we do not see any improvement in the explanatory power in our model as R2 and RMSE do not change. Given that (instant_bookable == FALSE) only affects a very small fraction of our observations, this was to be expected. Regarding colinearity, there seems to be no significant correlation between instant_bookable and other explanatory variables due to all GVIF values are less than 5. Due to all this, we will not consider this variable going forward.
Determining whether location is a predictor of price_4_nights.
regression_df_clean_neighbourhood <- regression_df_2 %>%
mutate(neighbourhood_cleansed_simplified = case_when(
neighbourhood_cleansed %in% c("Indre By",
"Vesterbro-Kongens Enghave",
"Nrrebro","sterbro",
"Frederiksberg",
"Amager Vest") ~ neighbourhood_cleansed,
TRUE ~ "Other"
))
regression_df_clean_neighbourhood %>%
count(neighbourhood_cleansed_simplified) %>%
arrange(desc(n)) | neighbourhood_cleansed_simplified | n |
|---|---|
| Other | 1386 |
| Indre By | 1374 |
| Vesterbro-Kongens Enghave | 1253 |
| Nrrebro | 1206 |
| sterbro | 760 |
| Frederiksberg | 739 |
| Amager Vest | 696 |
model6 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
accommodates +
neighbourhood_cleansed_simplified,
data=regression_df_clean_neighbourhood
)
msummary(model6) Estimate
(Intercept) 7.7667492
prop_type_simplifiedEntire rental unit 0.0086385
prop_type_simplifiedEntire residential home 0.1004274
prop_type_simplifiedOther 0.1376155
prop_type_simplifiedPrivate room in residential home 0.2509233
number_of_reviews -0.0005483
review_scores_rating -0.0047721
room_typeHotel room -0.2417652
room_typePrivate room -0.5815066
room_typeShared room -0.3961472
accommodates 0.1254778
neighbourhood_cleansed_simplifiedFrederiksberg 0.0415391
neighbourhood_cleansed_simplifiedIndre By 0.2968230
neighbourhood_cleansed_simplifiedNrrebro -0.0399046
neighbourhood_cleansed_simplifiedOther -0.1981164
neighbourhood_cleansed_simplifiedsterbro 0.0422716
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.0461336
Std. Error t value
(Intercept) 0.0515102 150.781
prop_type_simplifiedEntire rental unit 0.0152074 0.568
prop_type_simplifiedEntire residential home 0.0296323 3.389
prop_type_simplifiedOther 0.0239592 5.744
prop_type_simplifiedPrivate room in residential home 0.0595529 4.213
number_of_reviews 0.0001315 -4.169
review_scores_rating 0.0095271 -0.501
room_typeHotel room 0.1008744 -2.397
room_typePrivate room 0.0255720 -22.740
room_typeShared room 0.1411430 -2.807
accommodates 0.0031598 39.711
neighbourhood_cleansed_simplifiedFrederiksberg 0.0228024 1.822
neighbourhood_cleansed_simplifiedIndre By 0.0202756 14.639
neighbourhood_cleansed_simplifiedNrrebro 0.0205713 -1.940
neighbourhood_cleansed_simplifiedOther 0.0199886 -9.911
neighbourhood_cleansed_simplifiedsterbro 0.0227066 1.862
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.0205580 2.244
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire rental unit 0.570025
prop_type_simplifiedEntire residential home 0.000705 ***
prop_type_simplifiedOther 9.68e-09 ***
prop_type_simplifiedPrivate room in residential home 2.55e-05 ***
number_of_reviews 3.10e-05 ***
review_scores_rating 0.616462
room_typeHotel room 0.016572 *
room_typePrivate room < 2e-16 ***
room_typeShared room 0.005020 **
accommodates < 2e-16 ***
neighbourhood_cleansed_simplifiedFrederiksberg 0.068547 .
neighbourhood_cleansed_simplifiedIndre By < 2e-16 ***
neighbourhood_cleansed_simplifiedNrrebro 0.052446 .
neighbourhood_cleansed_simplifiedOther < 2e-16 ***
neighbourhood_cleansed_simplifiedsterbro 0.062699 .
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.024861 *
Residual standard error: 0.3948 on 6445 degrees of freedom
(952 observations deleted due to missingness)
Multiple R-squared: 0.4423, Adjusted R-squared: 0.4409
F-statistic: 319.5 on 16 and 6445 DF, p-value: < 2.2e-16
car::vif(model6) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 3.232083 4 1.157938
number_of_reviews 1.066288 1 1.032612
review_scores_rating 1.008979 1 1.004480
room_type 2.939444 3 1.196862
accommodates 1.239477 1 1.113318
neighbourhood_cleansed_simplified 1.172223 6 1.013330
autoplot(model6)Interpretation:
Immediately we notice how compared to our current preferred model (model3) our adjusted R2 increases by roughly 0.08 and that our RMSE falls from 0.4221 to 0.3948. Moreover, all of our neighborhood dummy variables are significant explanatory variables at the 10% level.The large impact on the explanatory power of our model when adding this variable makes intuitive semse. Like with real estate, prices for accomodation will vary by region. Some neighborhoods will ask for a price premium due to their proxmitiy to the city centre, others because the neighbourhood is clean and affluent.
Additionally, multicolinearity seems to be of no issue given that all GVIF factors remain below 5 and since neighbourhood_cleansed_simplified itself only has a VIF of 1.17. Lastly, when looking at the residuals vs fitted and scale-location plots we can see that they are almost random. This effect on our model was also to be expected given neighborhood’s expalantory power and the categorical distribution across listings. For model 3, one could almost witness patterns of vertical lines. Adding information on location has seemingly reduced a large part of these remaining patterns.
Hence, going forward, model6 is now our preferred model.
Note: availability_30 is defined as the availability of the listing 30 days in the future as determined by the calendar.
model7 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
accommodates +
neighbourhood_cleansed_simplified+
availability_30,
data=regression_df_clean_neighbourhood
)
msummary(model7) Estimate
(Intercept) 7.5357940
prop_type_simplifiedEntire rental unit 0.0097386
prop_type_simplifiedEntire residential home 0.0999257
prop_type_simplifiedOther 0.1270435
prop_type_simplifiedPrivate room in residential home 0.2199067
number_of_reviews -0.0004349
review_scores_rating 0.0227773
room_typeHotel room -0.5402829
room_typePrivate room -0.6107811
room_typeShared room -0.6143250
accommodates 0.1260141
neighbourhood_cleansed_simplifiedFrederiksberg 0.0547302
neighbourhood_cleansed_simplifiedIndre By 0.2729825
neighbourhood_cleansed_simplifiedNrrebro -0.0112098
neighbourhood_cleansed_simplifiedOther -0.1901813
neighbourhood_cleansed_simplifiedsterbro 0.0377509
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.0704718
availability_30 0.0168311
Std. Error t value
(Intercept) 0.0481318 156.566
prop_type_simplifiedEntire rental unit 0.0140600 0.693
prop_type_simplifiedEntire residential home 0.0273964 3.647
prop_type_simplifiedOther 0.0221537 5.735
prop_type_simplifiedPrivate room in residential home 0.0550673 3.993
number_of_reviews 0.0001217 -3.574
review_scores_rating 0.0088475 2.574
room_typeHotel room 0.0936979 -5.766
room_typePrivate room 0.0236590 -25.816
room_typeShared room 0.1306594 -4.702
accommodates 0.0029214 43.135
neighbourhood_cleansed_simplifiedFrederiksberg 0.0210856 2.596
neighbourhood_cleansed_simplifiedIndre By 0.0187596 14.552
neighbourhood_cleansed_simplifiedNrrebro 0.0190389 -0.589
neighbourhood_cleansed_simplifiedOther 0.0184820 -10.290
neighbourhood_cleansed_simplifiedsterbro 0.0209937 1.798
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.0190210 3.705
availability_30 0.0005084 33.105
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire rental unit 0.488558
prop_type_simplifiedEntire residential home 0.000267 ***
prop_type_simplifiedOther 1.02e-08 ***
prop_type_simplifiedPrivate room in residential home 6.59e-05 ***
number_of_reviews 0.000353 ***
review_scores_rating 0.010062 *
room_typeHotel room 8.48e-09 ***
room_typePrivate room < 2e-16 ***
room_typeShared room 2.63e-06 ***
accommodates < 2e-16 ***
neighbourhood_cleansed_simplifiedFrederiksberg 0.009464 **
neighbourhood_cleansed_simplifiedIndre By < 2e-16 ***
neighbourhood_cleansed_simplifiedNrrebro 0.556026
neighbourhood_cleansed_simplifiedOther < 2e-16 ***
neighbourhood_cleansed_simplifiedsterbro 0.072191 .
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.000213 ***
availability_30 < 2e-16 ***
Residual standard error: 0.3651 on 6444 degrees of freedom
(952 observations deleted due to missingness)
Multiple R-squared: 0.5234, Adjusted R-squared: 0.5221
F-statistic: 416.3 on 17 and 6444 DF, p-value: < 2.2e-16
car::vif(model7) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 3.233689 4 1.158010
number_of_reviews 1.067135 1 1.033022
review_scores_rating 1.017985 1 1.008953
room_type 2.974696 3 1.199243
accommodates 1.239515 1 1.113335
neighbourhood_cleansed_simplified 1.188643 6 1.014505
availability_30 1.047300 1 1.023377
autoplot(model7)Interpretation:
We can see how compared to our current preferred model (model6) our adjusted R2 increases even further, roughly by 0.08, and that our RMSE falls significantly as well (from 0.3948. to 0.3651). availability_30 is also a significant explanatory variable at the 1% level for price.
Multicolinearity seems to be of no issue as well with GVIFs remaining below 5 across the board. When looking at the residuals vs fitted and scale-location plots we can see that adding availability_30 has further improved our model regarding the independence of error terms assumption. Therefore, model7 is now our preferred model.
model8 <- lm(log(price_4_nights) ~ prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type +
accommodates +
neighbourhood_cleansed_simplified+
availability_30+
reviews_per_month,
data=regression_df_clean_neighbourhood
)
msummary(model8) Estimate
(Intercept) 7.5480308
prop_type_simplifiedEntire rental unit -0.0018942
prop_type_simplifiedEntire residential home 0.0838592
prop_type_simplifiedOther 0.1187630
prop_type_simplifiedPrivate room in residential home 0.2172859
number_of_reviews -0.0001881
review_scores_rating 0.0246266
room_typeHotel room -0.5143128
room_typePrivate room -0.6092725
room_typeShared room -0.6064637
accommodates 0.1256042
neighbourhood_cleansed_simplifiedFrederiksberg 0.0541359
neighbourhood_cleansed_simplifiedIndre By 0.2790454
neighbourhood_cleansed_simplifiedNrrebro -0.0087199
neighbourhood_cleansed_simplifiedOther -0.1865979
neighbourhood_cleansed_simplifiedsterbro 0.0382876
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.0725633
availability_30 0.0169401
reviews_per_month -0.0211813
Std. Error t value
(Intercept) 0.0480596 157.056
prop_type_simplifiedEntire rental unit 0.0141703 -0.134
prop_type_simplifiedEntire residential home 0.0274710 3.053
prop_type_simplifiedOther 0.0221457 5.363
prop_type_simplifiedPrivate room in residential home 0.0549327 3.955
number_of_reviews 0.0001287 -1.461
review_scores_rating 0.0088314 2.789
room_typeHotel room 0.0935747 -5.496
room_typePrivate room 0.0236019 -25.815
room_typeShared room 0.1303428 -4.653
accommodates 0.0029150 43.089
neighbourhood_cleansed_simplifiedFrederiksberg 0.0210336 2.574
neighbourhood_cleansed_simplifiedIndre By 0.0187427 14.888
neighbourhood_cleansed_simplifiedNrrebro 0.0189966 -0.459
neighbourhood_cleansed_simplifiedOther 0.0184467 -10.116
neighbourhood_cleansed_simplifiedsterbro 0.0209419 1.828
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.0189773 3.824
availability_30 0.0005075 33.378
reviews_per_month 0.0036839 -5.750
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire rental unit 0.893665
prop_type_simplifiedEntire residential home 0.002278 **
prop_type_simplifiedOther 8.48e-08 ***
prop_type_simplifiedPrivate room in residential home 7.72e-05 ***
number_of_reviews 0.144079
review_scores_rating 0.005310 **
room_typeHotel room 4.03e-08 ***
room_typePrivate room < 2e-16 ***
room_typeShared room 3.34e-06 ***
accommodates < 2e-16 ***
neighbourhood_cleansed_simplifiedFrederiksberg 0.010082 *
neighbourhood_cleansed_simplifiedIndre By < 2e-16 ***
neighbourhood_cleansed_simplifiedNrrebro 0.646234
neighbourhood_cleansed_simplifiedOther < 2e-16 ***
neighbourhood_cleansed_simplifiedsterbro 0.067553 .
neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave 0.000133 ***
availability_30 < 2e-16 ***
reviews_per_month 9.35e-09 ***
Residual standard error: 0.3641 on 6443 degrees of freedom
(952 observations deleted due to missingness)
Multiple R-squared: 0.5258, Adjusted R-squared: 0.5245
F-statistic: 396.9 on 18 and 6443 DF, p-value: < 2.2e-16
car::vif(model8) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 3.310160 4 1.161398
number_of_reviews 1.200643 1 1.095738
review_scores_rating 1.019337 1 1.009622
room_type 2.981912 3 1.199727
accommodates 1.240257 1 1.113668
neighbourhood_cleansed_simplified 1.196300 6 1.015048
availability_30 1.048762 1 1.024091
reviews_per_month 1.185969 1 1.089022
autoplot(model8)Interpretation:
From the above, we can see reviews_per_month only marginally improves the explanatory power of our model. Computing AICs in the following section will help us identify whether model8 should be preferred over model7.
library(huxtable)
huxreg(model1, model2, model3, model4, model5, model6, model7, model8,
statistics = c('R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma',
'AIC' = 'AIC'),
bold_signif = 0.05
) %>%
set_caption('Comparison of models')| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 8.212 *** | 8.232 *** | 7.778 *** | 7.777 *** | 7.778 *** | 7.767 *** | 7.536 *** | 7.548 *** |
| (0.062) | (0.057) | (0.052) | (0.053) | (0.053) | (0.052) | (0.048) | (0.048) | |
| prop_type_simplifiedEntire rental unit | 0.040 * | 0.036 * | 0.022 | 0.022 | 0.022 | 0.009 | 0.010 | -0.002 |
| (0.020) | (0.018) | (0.016) | (0.016) | (0.016) | (0.015) | (0.014) | (0.014) | |
| prop_type_simplifiedEntire residential home | 0.318 *** | 0.318 *** | -0.030 | -0.030 | -0.030 | 0.100 *** | 0.100 *** | 0.084 ** |
| (0.036) | (0.033) | (0.031) | (0.031) | (0.031) | (0.030) | (0.027) | (0.027) | |
| prop_type_simplifiedOther | -0.241 *** | 0.345 *** | 0.175 *** | 0.175 *** | 0.175 *** | 0.138 *** | 0.127 *** | 0.119 *** |
| (0.023) | (0.028) | (0.025) | (0.025) | (0.025) | (0.024) | (0.022) | (0.022) | |
| prop_type_simplifiedPrivate room in residential home | -0.587 *** | 0.362 *** | 0.149 * | 0.149 * | 0.149 * | 0.251 *** | 0.220 *** | 0.217 *** |
| (0.069) | (0.070) | (0.063) | (0.063) | (0.063) | (0.060) | (0.055) | (0.055) | |
| number_of_reviews | -0.000 * | 0.000 | -0.000 | -0.000 | -0.000 | -0.001 *** | -0.000 *** | -0.000 |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| review_scores_rating | -0.005 | -0.012 | -0.009 | -0.009 | -0.009 | -0.005 | 0.023 * | 0.025 ** |
| (0.012) | (0.011) | (0.010) | (0.010) | (0.010) | (0.010) | (0.009) | (0.009) | |
| room_typeHotel room | -0.284 * | -0.153 | -0.153 | -0.153 | -0.242 * | -0.540 *** | -0.514 *** | |
| (0.120) | (0.108) | (0.108) | (0.108) | (0.101) | (0.094) | (0.094) | ||
| room_typePrivate room | -0.967 *** | -0.638 *** | -0.638 *** | -0.638 *** | -0.582 *** | -0.611 *** | -0.609 *** | |
| (0.029) | (0.027) | (0.027) | (0.027) | (0.026) | (0.024) | (0.024) | ||
| room_typeShared room | -0.685 *** | -0.497 *** | -0.497 *** | -0.497 *** | -0.396 ** | -0.614 *** | -0.606 *** | |
| (0.168) | (0.151) | (0.151) | (0.151) | (0.141) | (0.131) | (0.130) | ||
| accommodates | 0.132 *** | 0.132 *** | 0.132 *** | 0.125 *** | 0.126 *** | 0.126 *** | ||
| (0.003) | (0.003) | (0.003) | (0.003) | (0.003) | (0.003) | |||
| host_is_superhostTRUE | 0.003 | |||||||
| (0.015) | ||||||||
| instant_bookableTRUE | -0.000 | |||||||
| (0.014) | ||||||||
| neighbourhood_cleansed_simplifiedFrederiksberg | 0.042 | 0.055 ** | 0.054 * | |||||
| (0.023) | (0.021) | (0.021) | ||||||
| neighbourhood_cleansed_simplifiedIndre By | 0.297 *** | 0.273 *** | 0.279 *** | |||||
| (0.020) | (0.019) | (0.019) | ||||||
| neighbourhood_cleansed_simplifiedNrrebro | -0.040 | -0.011 | -0.009 | |||||
| (0.021) | (0.019) | (0.019) | ||||||
| neighbourhood_cleansed_simplifiedOther | -0.198 *** | -0.190 *** | -0.187 *** | |||||
| (0.020) | (0.018) | (0.018) | ||||||
| neighbourhood_cleansed_simplifiedsterbro | 0.042 | 0.038 | 0.038 | |||||
| (0.023) | (0.021) | (0.021) | ||||||
| neighbourhood_cleansed_simplifiedVesterbro-Kongens Enghave | 0.046 * | 0.070 *** | 0.073 *** | |||||
| (0.021) | (0.019) | (0.019) | ||||||
| availability_30 | 0.017 *** | 0.017 *** | ||||||
| (0.001) | (0.001) | |||||||
| reviews_per_month | -0.021 *** | |||||||
| (0.004) | ||||||||
| R squared | 0.070 | 0.210 | 0.362 | 0.362 | 0.362 | 0.442 | 0.523 | 0.526 |
| Adj. R Squared | 0.069 | 0.209 | 0.361 | 0.361 | 0.361 | 0.441 | 0.522 | 0.524 |
| Residual SE | 0.509 | 0.470 | 0.422 | 0.422 | 0.422 | 0.395 | 0.365 | 0.364 |
| AIC | 9630.980 | 8583.554 | 7204.709 | 7206.343 | 7206.708 | 6347.539 | 5334.606 | 5303.533 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||||
Interpretation:
As we can see from the table above, our interpretations throughout have also been confirmed by the computed AICs (Akaike Information Criterions). The AIC stipulates that the best model is the one where the least amount of explanatory variables explain the greatest amount of variation. The smaller the AIC, the better. From the table above we can see that the AIC has fallen continuously as we moved from one preferred model to the next. This also helps us to identify model8 as our best overall model.
Question: Suppose you are planning to visit the city you have been assigned to over reading week, and you want to stay in an Airbnb. Find Airbnb’s in your destination city that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights.
staying <-
# Create new data frame with imaginary scenario
tibble(prop_type_simplified = "Entire rental unit",
number_of_reviews = 10,
review_scores_rating = 90,
room_type = "Private room",
accommodates = 2,
neighbourhood_cleansed_simplified = "Indre By",
availability_30 =5.867413,
reviews_per_month = 0.9128969
)
model_prediction <-data.frame(predict(model8, newdata = staying, interval = "prediction")) %>%
#accounting for log transformation in price via exp() function
mutate(price_4_nights = exp(fit)*0.16,
CI_lower = exp(lwr)*0.16,
CI_upper = exp(upr)*0.16) %>%
select(-fit, -lwr, -upr)
model_prediction| price_4_nights | CI_lower | CI_upper |
|---|---|---|
| 2.78e+03 | 538 | 1.43e+04 |
Interpretation:
For our model8, we specified to have a private room in an entire rental unit. We put number of reviews equal to 10, average review rating to 90 and accommodation to 2. We want a room in the best neighbourhood (Indre By). Also, we specified availability_30 and reviews_per_month to be equal to their respective means.
Accounting for our log transformation of price and an exchange rate of currently 0.16 (Danish Krone to USD), we get an expected price_4_nights of c.2776 USD. This translates into c.694 USD per day. This is a lot, but we are renting an entire unit in the best part of Copenhagen, which is quite an expensive city in itself.
Our 95% confidence interval for price_4_nightsranges from c.134 USD per day to c.3585 USD per day. This shows that there is still great variability in our model and hence it would need to be improved further. Most importantly, however, there will be some prices that have been recorded incorrectly on both ends, which may cause our results to be heavily skewed. This would require further investigation which, due to time constraints, was beyond the scope of this project.
Secondly, further analysis could be done by testing the remaining variables within the dataset and see whether they improve our model’s explanatory power. Also, we have not looked at any interaction between variables, for example the interaction between room type and neighborhood would be an interesting one to observe given that we would then know the effect of each room type in a specific neighborhood. Moreover, one could also introduce data from outside this current set, e.g. a dummy variable whether a hotel is within close range.
Thirdly, this is only a very simple regression model of price. One could investigate whether estimators other than OLS would make more sense for our specific scenario. Also, if we had time series data on AirBnb prices wihtin Copenhangen, one could account for things like seasons, for example, which would allow us to determine the best time of the year to travel to Copenhagen.